Final Project

We would want to explore the data we received by the 5 core activities of data analysis that we learned in class.

Our question is of type exploratory - we would want to check if there are some relations between the different data sets we received.

We received the data sets:

Our first assignment is to explore the data, so we would want to start with that.
We will explore the data by doing some statistical analysis and plots for a better understanding, and for last we will try to present the results as best as we can.

Installations

Note: We have turned off the comments issued by the chunk, because they are only prints of the system and irrelevant

# install.packages("rtweet")
# install.packages("tidytext")
# install.packages("robotstxt")
# install.packages("lexicon")
# install.packages("stringr")
# install.packages("reshape")
#install.packages("multipanelfigure")

Libraries

Note: We have turned off the comments issued by the chunk, because they are only prints of the system and irrelevant.

library(reshape)
library(tidyverse)
library(robotstxt)
library(rvest)
library(magrittr)
library(lubridate)
library(tidymodels)
library(tidytext)
library(textrecipes)
library(glue)
library(countrycode)
library(plotly)
library(tools)
library(broom)
library(ggridges)
library(corrplot)

Q1

Scraping the data and organize it

Here is a link to the site from which the information on the corona cases was taken
https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/United_States_medical_cases

First we check if the site allows us to access the table and perform web-scraping

paths_allowed("https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/United_States_medical_cases")
## [1] TRUE

Note: We have turned off the comments issued by the chunk, because they are only prints of the system and irrelevant.


We will now perform web-scraping in order to extract thr table from the web

# wiki_url <- 'https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/United_States_medical_cases'
# wiki_page <- read_html(wiki_url)
# Wiki_Cases <- wiki_page%>%
#   html_nodes ('#mw-content-text > div.mw-parser-output > div:nth-child(4) > table > tbody')%>%
#   html_table()
# Wiki_Cases <- as.data.frame(Wiki_Cases) # save the table as data frame
# Wiki_Cases
# saveRDS(Wiki_Cases, file = "data\\Wiki_Cases.rds")


Now we want to change the type of columns, it can be seen from the data frame that confirmed cases is chr and not numeric.
We will create a function to organize the data frame.

Wiki_Cases<- readRDS("data\\Wiki_Cases.rds")
country<- unlist(Wiki_Cases[1,]) # save vector first row for future use
Wiki_Cases <- Wiki_Cases %>%
  filter(!Date %in% c('Date','Total','Refs','Notes'))%>%
  # fiter for removing the header of the table which appears many times inside the table
  mutate(Date=dmy(Date),Date.1=dmy(Date.1))  # mutate function for set column date as type date


Function to convert all columns from char type to dbl type

casting_dbl<- function(df){
  columns_name <- colnames(df) # get column names of df as vector to go over by iterations 
  for(i in 1:length(columns_name))
    {
    # apart from date column which we handle earlier, each column type char convert to dbl 
    if(typeof(df[, columns_name[i]])=="character") 
      {df[,columns_name[i]]<- as.numeric(gsub(",", "", df[,columns_name[i]]))}
    # we use gsub function for removing coma from numbers
    else{next}
    }
  df[is.na(df)] = 0 # all places with NA will have now zeros
  return(df)
}


Function to convert change columns name

Brief description of the function :
1) Use a dictionary from the web which converts abbreviations of states to their names.
2) Replace spaces between two word with “_”
3) Change the name of some columns which have at the origin table sub columns


Here is a link for the origin of abbreviations dictionary ‘https://bit.ly/2ToSrFv

change_columns_names <- function(df , country){
  
        dict_abbrev<- read.csv('https://bit.ly/2ToSrFv')
        abv_country<- gsub(" ", "_", head(setdiff(country, colnames(df)),-2)) # extract difference 
        for (i in 1:length(abv_country)) 
        {if(abv_country[i] %in% dict_abbrev$abbreviation) # use %in% to look for country at abbreviations dict
          {abv_country[i]<- countrycode::countrycode(abv_country[i], 'abbreviation', 'state',custom_dict = dict_abbrev)}
        else{next}}
        
        abv_country<-gsub(" ", "_",abv_country) 
        search_for<- c('Date','Territories','Confirmed','Deaths','Recovered','Active') # columns with sub columns
        column_name <- colnames(df)
        res <- vector(mode="character", length=length(column_name)) # create new vector
        for (j in 1:length(search_for)) 
          {
          starts <-startsWith((column_name), search_for[j]) # use start with function to find name start with the same name
          for (i in 1:length(starts)) 
            {
            if(starts[i] == TRUE )
            {res[i] <-search_for[j]} # if TRUE means we find a match and we want to set the value at new vector 
          }
        }
        # now putting it all together
        for (i in 2:length(res)){
          if(res[i] == "") # if value is "" we didn't find a match so take from abv_country
          {res[i]<-paste(res[i], abv_country [i-1], sep = "")}
          
          else{res[i]<-paste(res[i], country[i], sep = "_")} # else append sub name column 
        }
        colnames(df)<-res
        return (df)
}


Call all function to clean table

Now we will apply all function together, to get a fresh new clean table.
Note: we found that the table also contains territories and city which not appear at dict_abbrev will append the separately

clean_data<- function(df){
  country<- case_when(
          country=="DC" ~ "Washington_D.C",
          country=="GU" ~ "Guam",
          country=="MP" ~ "Northern_Mariana_Islands",
          country=="PR" ~ "Puerto_Rico",
          country=="VI" ~ "Virgin_Islands",
          TRUE ~ country)
  
  df_casting_dbl<-casting_dbl(df)
  df_change_columns<- change_columns_names(df_casting_dbl, country)
  return (df_change_columns)
}
Cases<- clean_data(Wiki_Cases)
Cases
# cases is sorted by columns names so we take region by [:]
West<-colnames(Cases)[2:14]
Midwest<-colnames(Cases)[15:27]
South<-colnames(Cases)[28:40]
Northeast<-colnames(Cases)[41:52]
Territories<-colnames(Cases)[53:56]


Exploration

We will have a glimpse over the data for a better understanding.

glimpse(Cases)
## Rows: 485
## Columns: 64
## $ Date                                 <date> 2020-01-21, 2020-01-24, 2020-01-~
## $ Alaska                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Arizona                              <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ~
## $ California                           <dbl> 0, 0, 1, 1, 0, 1, 0, 3, 0, 0, 2, ~
## $ Colorado                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Hawaii                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Idaho                                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Montana                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ New_Mexico                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Nevada                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Oregon                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Utah                                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Washington                           <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Wyoming                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Iowa                                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Illinois                             <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, ~
## $ Indiana                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Kansas                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Michigan                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Minnesota                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Missouri                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ North_Dakota                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Nebraska                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Ohio                                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Oklahoma                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ South_Dakota                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Wisconsin                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
## $ Alabama                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Arkansas                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Florida                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Georgia                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Kentucky                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Louisiana                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Mississippi                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ North_Carolina                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ South_Carolina                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Tennessee                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Texas                                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Virginia                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ West_Virginia                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Connecticut                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Washington_D.C                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Delaware                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Massachusetts                        <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ~
## $ Maryland                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Maine                                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ New_Hampshire                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ New_Jersey                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ New_York                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Pennsylvania                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Rhode_Island                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Vermont                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Territories_Guam                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Territories_Northern_Mariana_Islands <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Territories_Puerto_Rico              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Territories_Virgin_Islands           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Date_Date                            <date> 2020-01-21, 2020-01-24, 2020-01-~
## $ Confirmed_Daily                      <dbl> 1, 1, 1, 2, 1, 1, 1, 3, 1, 0, 2, ~
## $ Confirmed_Total                      <dbl> 1, 2, 3, 5, 6, 7, 8, 11, 12, 12, ~
## $ Deaths_Daily                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Deaths_Total                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Recovered_Daily                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, ~
## $ Recovered_Total                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 5, ~
## $ Active_Total                         <dbl> 1, 2, 3, 5, 6, 7, 8, 11, 12, 9, 9~

We can see that the cases data frame has 2 different data frames:

We noticed there are only 2 data types:


Separating to different data frames.

For start We would want to separate the 2 data frames for easier working.
After that, we would want to do some calculations over the regions of the states, so we will create separated data frames for each region, and a data frame of the states cases by each region.

index1 <- c(1, 58:64)
Total_Df <- Cases %>% 
  select(all_of(index1))

index2 <- c(1:56)
Cases_States <- Cases %>% 
  select(all_of(index2))

We now have two data frames with different roles.


Let’s check the dimensions of the df.

dim(Cases_States)
## [1] 485  56
dim(Total_Df)
## [1] 485   8

All the data frames have 484 rows that represent the date of cases.
The cases data frame have 56 variables that represent the states and territories.
The total_df data frame has 8 variables that represents total amount of cases.
We know that the rows represent data that was collected in each day , so we will check the top and bottom of the data to understand when it starts and when it ends.


Lets see the top and bottom of the cases states data frame

Cases_States <- Cases_States %>% 
  arrange(Date) 

head(Cases_States, 5)
tail(Cases_States, 5)

First, we used the arrange function to arrange by the date, after we can see our data starts at the 21.1.20 and ends at the 11.6.21.


Creating data frames for region

We want to create data frame representing the regions so we have a look over the data from another angle.

Region_Cases <- Cases %>% 
  mutate(Total_West = rowSums(.[2:14]), Total_Midwest = rowSums(.[15:27]),
         Total_South = rowSums(.[28:40]), Total_Northeast = rowSums(.[41:52]),
         Total_Territories = rowSums(.[53:56])) %>% 
  select(Date, Total_West, Total_Midwest , Total_South, Total_Northeast,
         Total_South, Total_Northeast, Total_Territories)


Cleaning the data

Now that we organized the data we would want to check if there are incorrect values in the data frame.

First, we will use the filter function to check for negative values.

Cases_States %>% 
    filter_all(any_vars(.<0))

We see that we have 32 rows containing negative values.

we know that the data represent cases that occurred in the states, so none of them should be negative.

At the beginning we thought the negative values are incorrect and we should replace them to positive values, but after studding the reason for the negative values we understand that they represent over count.

We planed to do statistical calculation over the data, so a better choice would be to leave the negative values. Also we would want to present some insights about specific days, so it might be better to change the negative values to 0.

We decided to leave the negative values in the part that we do the statistical analysis and after change the values to 0.


Statistical analysis

We would want to do some statistical analysis over the data for a better understanding.

We will create a function that will help us with those calculations.

stats <- function(data){
    mean_data <- mean(data, na.rm = T)
    std_data <- sd(data ,na.rm = T)
    med_data <- median(data, na.rm = T)
    max_data <- max(data, na.rm = T)
    data_insigts <- c(mean_data,std_data,med_data,max_data) 
    return (data_insigts)
}

The function will calculate Mean, Std, Median and Max value for each vector.


Analysis for states cases data frame

We would want to calculate statistic analysis for the cases data frame.

We used the map_df function for making the code shorter.

#organize df for map function
Stats_Df <- Cases_States %>% 
  select(all_of(2:56))
#use stats function over all the data with map
Stats_Data <-map_df(Stats_Df, stats)
#Setting Estimate vector
Estimate <- c("Mean","Std","Median","Max")

#setting stats_data df
Stats_Data <- Stats_Data %>% 
  mutate("Estimate"= Estimate) %>% 
  relocate(Estimate)

Stats_Data


With those calculations we can have some insights: Let’s see which state has the highest number of cases in a single day

colnames(Stats_Data)[apply(Stats_Data,1,which.max)][1]
## [1] "California"
Stats_Data %>%
  summarise_if(is.numeric,max) %>%
  max()
## [1] 53711

It might be because California population is very large.
But let’s not assume things before we check the proportion of it’s population compeare with other states

  • We can explore the average number of cases in each state.

  • We can understand in which states the corona was stable and in which wasn’t by the Std value - we have an assumption that in states with high std compared to the population count the corona affected the population in multiple waves.

  • We noticed that corona virus almost didnt affect the Northen Mariana Islands as the average number of cases per day is less then 1 and the maximum number of cases per day is 8.


We want to see the data from a different perspective, so we created a box plot representing the region data.

 region_box_plot <- Region_Cases %>% 
  gather(key,value,-Date) %>% 
  ggplot(aes(x = reorder(key,value), y = value, fill = key)) +
  geom_boxplot(notch = TRUE) +
  scale_fill_manual(values=c("#66CC99", "#E69F00", "#56B4E9", "#293352","#CC6666")) + 
  labs(x = "Regions", y = "Value", title = "Box plot of total cases", subtitle = "Sum by each region and territory", fill = "Regions")
ggplotly(region_box_plot) %>% 
     layout(title = list(text = paste0('Box plot of total cases','<br>','<sup>','Summed by each region and territory','</sup>')))

Graph explanation:

  • South region had the highest cases

  • Territories area had the Least number of cases - an assumption would be because they are islands and it was easier for them to not allow entry to their areas.

  • Median , q1 , q3 top and bottom cases per day.


Adding a Population data set

Because the number of confirmed cases differs between states, where one of the reasons is the size of the potential population to be infected.
We normalized the total number of confirmed cases by th state population in it.
Here is a link for the source of the data set “https://www.infoplease.com/us/states/state-population-by-rank

We will now perform web-scraping in order to extract the table from the web

population_url <-"https://www.infoplease.com/us/states/state-population-by-rank"
paths_allowed(population_url)
## [1] TRUE

Note: We have turned off the comments issued by the chunk, because they are only prints of the system and irrelevant.


# Population_Df <- read_html(population_url)
# Population_Df <- Population_Df %>%
#   html_nodes ('#A0922497 > tbody')%>%
#   html_table()%>%
#   as.data.frame()%>%
#   mutate(Percent_of_Total= as.numeric(gsub("%", "", Percent.of.Total)),# replaced the percentages symbol %.
#   State= gsub(" ", "_", State), "Size_Population"= as.numeric(gsub(",", "", July.2019.Estimate))) %>%
#   select(Rank,State,Percent_of_Total,Size_Population)

# saveRDS(Population_Df, file = "data\\Population_Df.rds")


We would like to have a quick look at the data we will use glimpse

Population_Df<- readRDS("data\\Population_Df.rds")
glimpse(Population_Df)
## Rows: 52
## Columns: 4
## $ Rank             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
## $ State            <chr> "California", "Texas", "Florida", "New_York", "Illino~
## $ Percent_of_Total <dbl> 11.91, 8.74, 6.47, 5.86, 3.86, 3.82, 3.52, 3.20, 3.16~
## $ Size_Population  <dbl> 39512223, 28995881, 21477737, 19453561, 12671821, 128~

We can see that we have:

  • State’s name column.

  • Rank column - this column is an indexing that organize the states by the size of population, where as the state with the highest population size, California, ranked by 1.

  • Percent of the state’s population compare to the total population in the US.

  • Population count for each state.


We will use the head function to see the data

head(Population_Df, n = 10)


Note: we added a function that scrapes the territories data from the internet by web scrapping, because we didn’t have it in the first population.

fetch_population <- function(selector){
  Territories_pop_url <- 'https://www.nationsonline.org/oneworld/US-states-population.htm'
  population <- read_html(Territories_pop_url)%>% 
  html_nodes (selector)%>%
  html_text()
  gsub(',','',population)%>%as.numeric()
}


Guam_selector <- 'body > div:nth-child(62) > div > div > div.ContCellPop'
Northern_Mariana_Islands_selector <- 'body > div:nth-child(63) > div > div > div.ContCellPop'
Puerto_Rico_selector <- 'body > div:nth-child(64) > div > div > div.ContCellPop'
Virgin_Islands_selector <- 'body > div:nth-child(65) > div > div > div.ContCellPop'


selectors <- c(Guam_selector,Northern_Mariana_Islands_selector,Puerto_Rico_selector,Virgin_Islands_selector)
get_population <- map_dbl(selectors,fetch_population)

Territor_pop_df <- tibble(
    State = Territories,
  Size_Population = get_population)

Territor_pop_df
Pop_Df <- Population_Df %>%
  select(State,Size_Population)
Pop_Df <- rbind(Pop_Df,Territor_pop_df)
Pop_Df


We built another 2 function for creating normalized cases count.

Description of the functions :

  1. append_factor - adding for each state or area another column representing region.

  2. apply_norm - function for creating the normalized cases count.

append_factor<-function(df){
  res<-as.vector(df$State)
  for(i in 1:length(res)){
  if (res[i] %in% West){res[i]="West"}
  if (res[i] %in% Midwest){res[i]="Midwest"}
  if (res[i] %in% South){res[i]="South"}
  if (res[i] %in% Northeast){res[i]="Northeast"}
  if (res[i] %in% Territories){res[i]="Territories"}
  }
  df<-df %>% mutate(Region=res)
  return(df)
}

apply_norm <- function(df1,df2){
  new_df<- df1 %>%
  select_if(is.numeric) %>% map_df(sum) %>%
     pivot_longer(cols = everything(),names_to = "State", values_to = "Count") %>% 
        left_join(df2, by = "State") %>%
    mutate(Normed = round(Count/Size_Population,digits=3)) %>% 
    drop_na()
    new_df<-append_factor(new_df)
    return(new_df)
}

Normed_Cases <-apply_norm(Cases_States, Pop_Df)

Note: While using the function we removed Washington DC. because it isn’t a state or territory.


Normed_Cases[which.max(Normed_Cases$Size_Population), ]

** Coming back to our understand that the amount of confirmed cases was the highest at US was at California, now we can be sure that in fact base on another data set that California has the maximum inhabitants at USA. **


Plots

Bar plot representing the cases normalized to population size by each region

We would plot the sum of cases in each state normalized by their population size and present them while we use the facet wrap function for separating for each region.

ggplot(data= Normed_Cases, aes(x =Normed , y = reorder(State,Normed) , fill = Region))+
  geom_bar(stat = "identity", color = "black", position="stack") +
  scale_x_continuous(labels=scales::percent) +
  scale_fill_manual(values=c("#66CC99", "#E69F00", "#56B4E9","#CC6666", "#293352"))+
  facet_wrap(~Region, ncol = 2,scales = "free") +

  geom_vline(data=Normed_Cases, aes(xintercept=mean(Normed), color=Region),
             color="red", size=0.8)+
  
  labs(x = "Cases ratio", y = " ", title = "Bar plot of the cases ratio in each state", subtitle = "Divided to each region and normalized by population count", fill = "Region")

Graph explanation:

  • The states with the highest ratio of cases compared to population are:

    • Midwest region - North Dakota

    • Northeast region - Rhode Island

    • West region - Utah

    • South region - Tennessee

  • In most of the states more then 5% of the population was affected by the corona.

  • Compared to the other regions the Midwest region population was affected the most.

  • Hawaii is the state the was affected at the lowest level compared to all the states.

  • We added a red line representing the mean normalized cases count for an observation for which states are above or under the mean.


Exploration for total cases df

We would want to present a plot that shows the change by count of each column that represent total counting in the total df. We will create a line plot that represent the changes over time for the total count normalized by the total population count.

For start lets have a glimpse over the df and explain the columns that we have

glimpse(Total_Df)
## Rows: 485
## Columns: 8
## $ Date            <date> 2020-01-21, 2020-01-24, 2020-01-25, 2020-01-26, 2020-~
## $ Confirmed_Daily <dbl> 1, 1, 1, 2, 1, 1, 1, 3, 1, 0, 2, 0, 1, 0, 4, 5, 18, 15~
## $ Confirmed_Total <dbl> 1, 2, 3, 5, 6, 7, 8, 11, 12, 12, 14, 14, 15, 15, 19, 2~
## $ Deaths_Daily    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 4, ~
## $ Deaths_Total    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 6, ~
## $ Recovered_Daily <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 1, 0, 0, 0, 1, 2, 0, ~
## $ Recovered_Total <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 5, 6, 6, 6, 6, 7, 9, 9, ~
## $ Active_Total    <dbl> 1, 2, 3, 5, 6, 7, 8, 11, 12, 9, 9, 8, 9, 9, 13, 16, 31~

Note: we explained the columns while we glimpsed cases df.


We would want to add the data frame some new variables that can help us in further calculations, We will add: total and daily ratio for death and recovery cases.

New_Total_Df <-Total_Df %>% filter(Confirmed_Daily>0) %>% 
  mutate(Total_Death_Ratio =Deaths_Total/Confirmed_Total,
         Daily_Death_Ratio =Deaths_Daily/Confirmed_Daily,
         Total_Recover_Ratio = Recovered_Total/Confirmed_Total,
         Daily_Recover_Ratio = Recovered_Daily/Confirmed_Daily) 
New_Total_Df 


Plot:

plot_ly(data= New_Total_Df %>% 
  gather(key,value,Deaths_Total,Confirmed_Total,Active_Total,Recovered_Total)) %>%

  add_trace(
      x = ~ Date,
      y = ~  value/sum(Pop_Df$Size_Population), #calculating total population
      color = ~ key,
      type = "scatter",
      mode = "lines",
      text = ~paste("Date: ", Date,
                    "</br>Accumulate number  : ", value,
                    "</br>Total_Population: ", sum(Pop_Df$Size_Population)),
      name = ~key) %>% 

      layout(title = paste("Line plot representing the ratio of active, confirmed, deaths and recoverd cases",
                       '<br>', "Normalized by the number of population in the US"),
             yaxis = list(title = "Ratio",  range = c(0, 0.05)), 
             xaxis = list(title = "Date"),
             legend = list(x = 0.99, y = 1))

Graph explanation:

  • We can see that the confirmed cases reached over 4% of the population.

  • The deaths ratio is less then 1%.

  • The recovery rate that change almost like the confirmed ratio.

  • the active cases ratio time line that has a peak between 1/1/21 to the 1/7/21.


Civiqs poll data frame exploration

We would start by loading the csv file

Civiqs_Poll<- read_csv('data\\civiqs_poll.csv')  # open file 
if("date" %in% colnames(Civiqs_Poll)==TRUE)
  {Civiqs_Poll$date <-as.Date(Civiqs_Poll$date, format = "%m/%d/%Y")} # cast date col from chr type to date type

Civiqs_Poll<- Civiqs_Poll %>% 
  rename_with(str_to_title) # convert capital letter in each cols name
Civiqs_Poll


Let’s have glimpse over the data for a better understanding

glimpse(Civiqs_Poll)
## Rows: 41
## Columns: 4
## $ Date <date> 2020-02-25, 2020-02-26, 2020-02-27, 2020-02-28, 2020-02-29, 2020~
## $ Dem  <dbl> 35, 35, 35, 35, 36, 37, 37, 37, 38, 39, 39, 41, 43, 46, 49, 52, 5~
## $ Rep  <dbl> -57, -58, -58, -58, -58, -58, -57, -56, -56, -56, -56, -56, -54, ~
## $ Diff <dbl> 92, 93, 93, 93, 94, 95, 94, 93, 94, 95, 95, 97, 97, 97, 97, 96, 9~

We have 4 columns which contains:

  • Date.

  • Democrat’s concerns presented in percentage.

  • Republicans concerns presented in percentage .

  • The difference in concern between both parties calculated by the formula: Diff = Dem - Rep.

We noticed from the glimpse function that the concern value doesn’t start from 0, it starts for Rep column from -48 and for Dem column at 35.


Lets see the top and bottom of the data frame

Civiqs_Poll %>% 
  head()
Civiqs_Poll %>% 
  tail()

We see that the data we have starts at the 2020-02-25 and ends at the 2020-04-05, so we have a little bit more than a month of the concern ratio of both parties.


lets see the dimension.

dim(Civiqs_Poll)
## [1] 41  4

We have 41 rows of data meaning 41 days.


We would want to plot the concern of both parties to have a better look.

plot_ly(data= Civiqs_Poll) %>% 
  add_trace(
      x = ~ Date,
      y = ~  Dem,
      type = "scatter",
      mode = "lines+markers",
      line = list(color="steelblue"),
      marker = list(color="steelblue"),
      hoverinfo = 'text',
      text = ~paste("</br>Dem:", Dem,
                    "</br>Date:", Date),
      text = ~paste("Date: ", Date),
      name = "Democrats") %>% 
    
    add_trace(
      x = ~ Date,
      y = ~  Rep,
      type = "scatter",
      mode = "lines+markers",
      line = list(color="red"),
      marker = list(color="red"),
      hoverinfo = 'text',
      text = ~paste("</br>Rep:", Rep,
                    "</br>Date:", Date),
      name = "Republicans") %>% 

      layout(title = paste("Concern ratio in each Party",
                       '<br>', "Between dates 2020-02-25 to 2020-04-05"),
             yaxis = list(title = "Concern (%)",ticksuffix = "%", range= c(-70, 100)), 
             xaxis = list(title = "Date"),
             legend = list(x = 0.99, y = 1),
             hovermode = "compare")

Note: We have turned off the comments issued by the chunk, because they are only prints of the system and irrelevant.

Graph explanation:

As we can see the scales goes from a minus value to 100%, we can understand that the scales goes from -100% meaning no concern to 100% meaning a full concern. We noticed that we have an increase in concern by both parties that starts at the 7-8/3/2020 for 2 weeks more or less. Lets try to understand why. While exploring the cases data frame we noticed we have data about deaths cases that happened during the pandemic, we believe the death rate by the virus might affect the concern level of both parties as it affects both of them. So we would want to calculate the death rate in the same time range of the concern time range. We will calculate it by using the formula: death rate = death cases / total cases.


Graph showing the diff concern with changes during the plague

We would like to link our two data sets, and explore if the increase in the number of new cases per day and in the number of daily deaths can explain the difference in the concern of the Democratic and Republican parties.

plot_ly(data= Cases %>% filter(Date<="2020-04-05" & Date >="2020-02-25")) %>%
  
  add_trace(
      x = ~ Date,
      y = ~ Confirmed_Daily/300,
      type = "scatter",
      mode = "lines+markers",
      hoverinfo = 'text',
      text = ~paste("</br>Confirmed_Daily:", Confirmed_Daily,
                    "</br>Date:", Date),
      name = "Confirmed_Daily",
      line = list(color = "black"),
      marker = list(color = "black")) %>%
  
  add_trace(
      x = ~ Date,
      y = ~ Deaths_Daily/10,
      type = "scatter",
      mode = "lines+markers",
      hoverinfo = 'text',
      text = ~paste("</br>Deaths_Daily:", Deaths_Daily,
                    "</br>Date:", Date),
      name = "Deaths_Daily",
      line = list(color = "hotpink"),
      marker = list(color = "hotpink")) %>%
  
   add_trace(data=Civiqs_Poll,
      x = ~ Date,
      y = ~ Diff,
      type = "scatter",
      mode = "lines+markers",
      hoverinfo = 'text',
      text = ~paste("<br>Diff:", Diff,
                    "</br>Dem:", Dem,
                    "</br>Rep:", Rep,
                    "</br>Date:", Date),
      name = "Diff",
      line = list(color = "steelblue"),
      marker = list(color = "steelblue")) %>% 


  layout(title = "Difference in support along with the spread of the plague",
         yaxis = list(title = "Number"), 
         xaxis = list(title = "Date"),
         legend = list(x = 0.99, y = 1),
         hovermode = "compare")

Graph explanation:

Let’s divide the tendency of diff line to 3 stages.

  1. First Stage: It can be noted that when the number of new cases and the number of deaths per day is low, the difference in opinions is high which can explain a polarization in the opinions of both parties.

  2. Second Stage: As the number of deaths per day and the verified increases, it can be seen that the difference in the opinions of the two parties decreases significantly, which can explain the common concern of the two parties regarding dealing with the party in US.

  3. Third Stage: At the end of the graph where there is a kind of stabilization in the difference between the parties that occurs along with the decrease in mortality and the decrease in the number of infected per day.

Note: we chose to divide some number just to explain a tendency and how it impact on other tendency. The real number is presented in text box.

p <- New_Total_Df %>% 
  select(Date,Total_Death_Ratio,Daily_Death_Ratio) %>%
  gather(key,value, -Date) %>% 
  ggplot(aes(x = Date, y = value, color = key)) +
  scale_x_date(limits = as.Date(c("2020-02-25","2020-04-05"))) +
    scale_y_continuous(labels=scales::percent) +
  geom_line() +
  labs(x = "Date", y = "Death ratio (%)", title = "Line plot representing Death Ratio", subtitle = "Over the USA population between dates 2020-02-25 to 2020-04-05")
ggplotly(p)%>% 
   layout(title = list(text = paste0('Line plot representing Death Ratio',
                                    '<br>',
                                    '<sup>',
                                     'Over the USA population between dates 2020-02-25 to 2020-04-05','</sup>')))

Graph explanation:

We can see that we have a steep rise in the death ratio that starts at the begging of march, that can explain the increase in concern for both parties.

We would want to check if the death cases can cause the increase in concern. We will create a plot combing the daily death and the concern ratio for both parties.


Brief description for the plot :

  1. We reset the level of concern to 0 for both parties in order to test their rise

  2. We join the concern and total_df by date to get the information only at the specific dates we have.

  3. We used gather function to do pivot longer for the 2 parties data

  4. we created a dot plot with regression presenting the increase of concern by the increase in daily death cases.

p <- Civiqs_Poll %>% 
  mutate(Dem = (Dem-35)/100, Rep = (Rep+58)/100) %>% 
  left_join(Total_Df, by = "Date") %>% 
  gather(key,value,Rep,Dem) %>% 
  ggplot(aes(y =value, x = Deaths_Daily, color = key)) +
  scale_y_continuous(labels=scales::percent) +
  geom_point() +geom_smooth(se=F) +
  labs(y = "Concern Rate", x = "Death Daily", title = "Increase in mutal concern for both parties", color = "Party")
ggplotly(p) %>% 
   layout(title = list(text = paste0('Dot plot representing the increase in concern for both parties',
                                    '<br>',
                                    '<sup>',
                                     'While daily deaths number goes up','</sup>')))

Graph explanation:

It can be seen that while the death daily count increased the concern ratio for both parties increased dramatically. The concern ratio in the Republican party has an increase of almost 80% and the Democratic party by 40%.

Note: We have turned off the comments issued by the chunk, because they are only prints of the system and irrelevant